Predicting coffee cupping scores

Revisiting the coffee dataset

lruolin
06-25-2022

Background

Revisiting the dataset to see if total cupping score can be predicted from the variables in the dataset.

What I know so far:

Key Objectives and Learning Points

Most of the exercises I encountered online use the 10 attributes as part of prediction for the total cupping score, but I decided to leave them out due to the way it is calculated. I wanted to look at whether other variables, such as country of origin, processing methods, moisture etc can predict the total cupping score.

However, towards the end of the exercises, when I was evaluating the model, I realised that my model is really quite sub-optimal. The r-sq was ridiculously low, and the rmse is about 3.

What could the reason be? Perhaps the choice of variables was not suitable, perhaps I did not engineer my features well enough? I think I already chose the easiest algorithm, as the dataset was very skewed and had a lot of categories, so I decided to use random forest, which doesn’t require a lot of feature engineering…

Perhaps these variables are not enough to predict the cupping score? Quality isn’t just a measure of country, region, processing methods, although they can give some indications as to whether the coffee beans are good… I’m probably missing some additional information…

Other exercises that used the 10 attributes had very good R-sq values, and understandably so, because total cupping points are calculated from these attributes by summing them up…

Nevertheless, I shall take it as an exercise using the tidymodels framework, and let me mull over how my ML can be improved…

Loading packages

library(pacman)
p_load(#qacData, 
       tidyverse, lubridate,
       janitor, skimr,
       ggsci, ggthemes, ggthemr,
       DT,
       tidymodels,
       GGally,
       doParallel, ranger,
       plotly,
       vip)

ggthemr::ggthemr("fresh")

Loading the data

coffee_ratings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv')

Data wrangling

Extracting column names

# extract col names-----
coffee_col_names <- colnames(coffee_ratings) %>% 
  as_tibble(.)

coffee_col_names %>% 
  print(n = nrow(.))
# A tibble: 43 × 1
   value                
   <chr>                
 1 total_cup_points     
 2 species              
 3 owner                
 4 country_of_origin    
 5 farm_name            
 6 lot_number           
 7 mill                 
 8 ico_number           
 9 company              
10 altitude             
11 region               
12 producer             
13 number_of_bags       
14 bag_weight           
15 in_country_partner   
16 harvest_year         
17 grading_date         
18 owner_1              
19 variety              
20 processing_method    
21 aroma                
22 flavor               
23 aftertaste           
24 acidity              
25 body                 
26 balance              
27 uniformity           
28 clean_cup            
29 sweetness            
30 cupper_points        
31 moisture             
32 category_one_defects 
33 quakers              
34 color                
35 category_two_defects 
36 expiration           
37 certification_body   
38 certification_address
39 certification_contact
40 unit_of_measurement  
41 altitude_low_meters  
42 altitude_high_meters 
43 altitude_mean_meters 

Initial look at the data

skim(coffee_ratings)
Table 1: Data summary
Name coffee_ratings
Number of rows 1339
Number of columns 43
_______________________
Column type frequency:
character 24
numeric 19
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
species 0 1.00 7 7 0 2 0
owner 7 0.99 3 50 0 315 0
country_of_origin 1 1.00 4 28 0 36 0
farm_name 359 0.73 1 73 0 571 0
lot_number 1063 0.21 1 71 0 227 0
mill 315 0.76 1 77 0 460 0
ico_number 151 0.89 1 40 0 847 0
company 209 0.84 3 73 0 281 0
altitude 226 0.83 1 41 0 396 0
region 59 0.96 2 76 0 356 0
producer 231 0.83 1 100 0 691 0
bag_weight 0 1.00 1 8 0 56 0
in_country_partner 0 1.00 7 85 0 27 0
harvest_year 47 0.96 3 24 0 46 0
grading_date 0 1.00 13 20 0 567 0
owner_1 7 0.99 3 50 0 319 0
variety 226 0.83 4 21 0 29 0
processing_method 170 0.87 5 25 0 5 0
color 218 0.84 4 12 0 4 0
expiration 0 1.00 13 20 0 566 0
certification_body 0 1.00 7 85 0 26 0
certification_address 0 1.00 40 40 0 32 0
certification_contact 0 1.00 40 40 0 29 0
unit_of_measurement 0 1.00 1 2 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_cup_points 0 1.00 82.09 3.50 0 81.08 82.50 83.67 90.58 ▁▁▁▁▇
number_of_bags 0 1.00 154.18 129.99 0 14.00 175.00 275.00 1062.00 ▇▇▁▁▁
aroma 0 1.00 7.57 0.38 0 7.42 7.58 7.75 8.75 ▁▁▁▁▇
flavor 0 1.00 7.52 0.40 0 7.33 7.58 7.75 8.83 ▁▁▁▁▇
aftertaste 0 1.00 7.40 0.40 0 7.25 7.42 7.58 8.67 ▁▁▁▁▇
acidity 0 1.00 7.54 0.38 0 7.33 7.58 7.75 8.75 ▁▁▁▁▇
body 0 1.00 7.52 0.37 0 7.33 7.50 7.67 8.58 ▁▁▁▁▇
balance 0 1.00 7.52 0.41 0 7.33 7.50 7.75 8.75 ▁▁▁▁▇
uniformity 0 1.00 9.83 0.55 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
clean_cup 0 1.00 9.84 0.76 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
sweetness 0 1.00 9.86 0.62 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
cupper_points 0 1.00 7.50 0.47 0 7.25 7.50 7.75 10.00 ▁▁▁▇▁
moisture 0 1.00 0.09 0.05 0 0.09 0.11 0.12 0.28 ▃▇▅▁▁
category_one_defects 0 1.00 0.48 2.55 0 0.00 0.00 0.00 63.00 ▇▁▁▁▁
quakers 1 1.00 0.17 0.83 0 0.00 0.00 0.00 11.00 ▇▁▁▁▁
category_two_defects 0 1.00 3.56 5.31 0 0.00 2.00 4.00 55.00 ▇▁▁▁▁
altitude_low_meters 230 0.83 1750.71 8669.44 1 1100.00 1310.64 1600.00 190164.00 ▇▁▁▁▁
altitude_high_meters 230 0.83 1799.35 8668.81 1 1100.00 1350.00 1650.00 190164.00 ▇▁▁▁▁
altitude_mean_meters 230 0.83 1775.03 8668.63 1 1100.00 1310.64 1600.00 190164.00 ▇▁▁▁▁
# count -----
coffee_ratings %>% 
  count(species) # mostly arabica
# A tibble: 2 × 2
  species     n
  <chr>   <int>
1 Arabica  1311
2 Robusta    28
coffee_ratings %>% 
  count(owner) %>% 
  arrange(desc(n))  # 316 unique owners
# A tibble: 316 × 2
   owner                                n
   <chr>                            <int>
 1 juan luis alvarado romero          155
 2 racafe & cia s.c.a                  60
 3 exportadora de cafe condor s.a      54
 4 kona pacific farmers cooperative    52
 5 ipanema coffees                     50
 6 cqi taiwan icp cqi台灣合作夥伴      47
 7 lin, che-hao krude 林哲豪           30
 8 nucoffee                            29
 9 carcafe ltda ci                     27
10 the coffee source inc.              23
# … with 306 more rows
# create a function to count and sort
count_and_sort <- function(col){
  coffee_ratings %>% 
    count({{col}}) %>% 
    arrange(desc(n))
}

# 37 unique countries
count_and_sort(country_of_origin)
# A tibble: 37 × 2
   country_of_origin                n
   <chr>                        <int>
 1 Mexico                         236
 2 Colombia                       183
 3 Guatemala                      181
 4 Brazil                         132
 5 Taiwan                          75
 6 United States (Hawaii)          73
 7 Honduras                        53
 8 Costa Rica                      51
 9 Ethiopia                        44
10 Tanzania, United Republic Of    40
# … with 27 more rows
count_and_sort(farm_name) # many missing values for farm name, 572 unique farms
# A tibble: 572 × 2
   farm_name                        n
   <chr>                        <int>
 1 <NA>                           359
 2 various                         47
 3 rio verde                       23
 4 several                         20
 5 finca medina                    15
 6 doi tung development project    13
 7 fazenda capoeirnha              13
 8 conquista / morito              11
 9 los hicaques                    11
10 capoeirinha                     10
# … with 562 more rows
count_and_sort(lot_number)
# A tibble: 228 × 2
   lot_number                                                        n
   <chr>                                                         <int>
 1 <NA>                                                           1063
 2 1                                                                18
 3 020/17                                                            6
 4 019/17                                                            5
 5 102                                                               3
 6 103                                                               3
 7 2                                                                 3
 8 2016 Tainan Coffee Cupping Event Micro Lot 臺南市咖啡評鑑批次     3
 9 017-053-0046                                                      2
10 1-198                                                             2
# … with 218 more rows
count_and_sort(owner)
# A tibble: 316 × 2
   owner                                n
   <chr>                            <int>
 1 juan luis alvarado romero          155
 2 racafe & cia s.c.a                  60
 3 exportadora de cafe condor s.a      54
 4 kona pacific farmers cooperative    52
 5 ipanema coffees                     50
 6 cqi taiwan icp cqi台灣合作夥伴      47
 7 lin, che-hao krude 林哲豪           30
 8 nucoffee                            29
 9 carcafe ltda ci                     27
10 the coffee source inc.              23
# … with 306 more rows
count_and_sort(farm_name)
# A tibble: 572 × 2
   farm_name                        n
   <chr>                        <int>
 1 <NA>                           359
 2 various                         47
 3 rio verde                       23
 4 several                         20
 5 finca medina                    15
 6 doi tung development project    13
 7 fazenda capoeirnha              13
 8 conquista / morito              11
 9 los hicaques                    11
10 capoeirinha                     10
# … with 562 more rows
count_and_sort(species)
# A tibble: 2 × 2
  species     n
  <chr>   <int>
1 Arabica  1311
2 Robusta    28
count_and_sort(mill)
# A tibble: 461 × 2
   mill                                n
   <chr>                           <int>
 1 <NA>                              315
 2 beneficio ixchel                   90
 3 dry mill                           39
 4 trilladora boananza                38
 5 ipanema coffees                    16
 6 neiva                              15
 7 bachue                             14
 8 beneficio siembras vision (154)    12
 9 cadexsa                            12
10 cigrah s.a de c.v.                 12
# … with 451 more rows
count_and_sort(ico_number)
# A tibble: 848 × 2
   ico_number        n
   <chr>         <int>
 1 <NA>            151
 2 0                77
 3 Taiwan           31
 4 2222             11
 5 -                 9
 6 002/1660/0105     7
 7 002/4177/0150     7
 8 Taiwan台灣        7
 9 002/1660/0080     6
10 1                 6
# … with 838 more rows
count_and_sort(company)
# A tibble: 282 × 2
   company                              n
   <chr>                            <int>
 1 <NA>                               209
 2 unex guatemala, s.a.                86
 3 ipanema coffees                     50
 4 exportadora de cafe condor s.a      40
 5 kona pacific farmers cooperative    40
 6 racafe & cia s.c.a                  40
 7 blossom valley宸嶧國際              25
 8 carcafe ltda                        25
 9 nucoffee                            24
10 taiwan coffee laboratory            20
# … with 272 more rows
count_and_sort(altitude)
# A tibble: 397 × 2
   altitude     n
   <chr>    <int>
 1 <NA>       226
 2 1100        43
 3 1200        42
 4 1300        32
 5 1400        32
 6 4300        31
 7 1250        30
 8 1500        30
 9 1700        28
10 1550        24
# … with 387 more rows
count_and_sort(region)
# A tibble: 357 × 2
   region             n
   <chr>          <int>
 1 huila            112
 2 oriente           80
 3 south of minas    68
 4 kona              66
 5 <NA>              59
 6 veracruz          35
 7 tarrazu           19
 8 comayagua         17
 9 huehuetenango     16
10 san marcos        16
# … with 347 more rows
count_and_sort(producer)
# A tibble: 692 × 2
   producer                         n
   <chr>                        <int>
 1 <NA>                           231
 2 La Plata                        30
 3 Ipanema Agrícola SA             22
 4 Doi Tung Development Project    17
 5 Ipanema Agricola                12
 6 VARIOS                          12
 7 Ipanema Agricola S.A            11
 8 ROBERTO MONTERROSO              10
 9 AMILCAR LAPOLA                   9
10 LA PLATA                         9
# … with 682 more rows
count_and_sort(bag_weight)
# A tibble: 56 × 2
   bag_weight     n
   <chr>      <int>
 1 1 kg         331
 2 60 kg        256
 3 69 kg        200
 4 70 kg        156
 5 2 kg         122
 6 100 lbs       59
 7 30 kg         29
 8 5 lbs         23
 9 6             19
10 20 kg         14
# … with 46 more rows
count_and_sort(in_country_partner)
# A tibble: 27 × 2
   in_country_partner                             n
   <chr>                                      <int>
 1 Specialty Coffee Association                 313
 2 AMECAFE                                      205
 3 Almacafé                                     178
 4 Asociacion Nacional Del Café                 155
 5 Brazil Specialty Coffee Association           67
 6 Instituto Hondureño del Café                  60
 7 Blossom Valley International                  58
 8 Africa Fine Coffee Association                49
 9 Specialty Coffee Association of Costa Rica    42
10 NUCOFFEE                                      36
# … with 17 more rows
count_and_sort(harvest_year)
# A tibble: 47 × 2
   harvest_year     n
   <chr>        <int>
 1 2012           354
 2 2014           233
 3 2013           181
 4 2015           129
 5 2016           124
 6 2017            70
 7 <NA>            47
 8 2013/2014       29
 9 2015/2016       28
10 2011            26
# … with 37 more rows
count_and_sort(grading_date)
# A tibble: 567 × 2
   grading_date             n
   <chr>                <int>
 1 July 11th, 2012         25
 2 December 26th, 2013     24
 3 June 6th, 2012          19
 4 August 30th, 2012       18
 5 July 26th, 2012         15
 6 March 29th, 2013        13
 7 October 8th, 2015       13
 8 September 27th, 2012    13
 9 June 17th, 2010         12
10 October 20th, 2017      11
# … with 557 more rows
count_and_sort(owner_1)
# A tibble: 320 × 2
   owner_1                              n
   <chr>                            <int>
 1 Juan Luis Alvarado Romero          155
 2 Racafe & Cia S.C.A                  60
 3 Exportadora de Cafe Condor S.A      54
 4 Kona Pacific Farmers Cooperative    52
 5 Ipanema Coffees                     50
 6 CQI Taiwan ICP CQI台灣合作夥伴      46
 7 Lin, Che-Hao Krude 林哲豪           29
 8 NUCOFFEE                            29
 9 CARCAFE LTDA CI                     27
10 The Coffee Source Inc.              23
# … with 310 more rows
count_and_sort(variety)
# A tibble: 30 × 2
   variety            n
   <chr>          <int>
 1 Caturra          256
 2 Bourbon          226
 3 <NA>             226
 4 Typica           211
 5 Other            110
 6 Catuai            74
 7 Hawaiian Kona     44
 8 Yellow Bourbon    35
 9 Mundo Novo        33
10 Catimor           20
# … with 20 more rows
count_and_sort(processing_method)
# A tibble: 6 × 2
  processing_method             n
  <chr>                     <int>
1 Washed / Wet                815
2 Natural / Dry               258
3 <NA>                        170
4 Semi-washed / Semi-pulped    56
5 Other                        26
6 Pulped natural / honey       14
count_and_sort(color)
# A tibble: 5 × 2
  color            n
  <chr>        <int>
1 Green          870
2 <NA>           218
3 Bluish-Green   114
4 Blue-Green      85
5 None            52
count_and_sort(expiration)
# A tibble: 566 × 2
   expiration               n
   <chr>                <int>
 1 December 26th, 2014     25
 2 July 11th, 2013         25
 3 June 6th, 2013          19
 4 August 30th, 2013       18
 5 July 26th, 2013         15
 6 March 29th, 2014        13
 7 October 7th, 2016       13
 8 September 27th, 2013    13
 9 June 17th, 2011         12
10 October 20th, 2018      11
# … with 556 more rows
count_and_sort(certification_body) %>% datatable()
count_and_sort(unit_of_measurement)
# A tibble: 2 × 2
  unit_of_measurement     n
  <chr>               <int>
1 m                    1157
2 ft                    182

Robusta vs Arabica

# robusta vs arabica -----

coffee_ratings %>% 
  ggplot(aes(total_cup_points, fill = species)) +
  geom_density(alpha = 0.4)

Change to correct columns and remove redundant columns

Filter to just include Arabica, since Robusta is traditionally a more inferior grade

coffee_a <- 
  coffee_ratings %>% 
  dplyr::filter(species == "Arabica") %>%
  mutate(across(where(is.character), as.factor)) %>% 
  mutate_at(c("altitude"), as.numeric) %>% 
  mutate_at(c("harvest_year"), as.character) %>% 
  mutate(grading_date = lubridate::mdy(grading_date)) %>% 
  mutate(expiration = lubridate::mdy(expiration)) %>% 
  # collapse less than 10 into 1 category
  mutate(certification_body = fct_lump_min(certification_body,10)) %>% 
  select(-lot_number,
         -ico_number,
         -bag_weight,
         -farm_name,
         -mill,
         -producer,
         -owner_1,
         -certification_address,
         -certification_contact
  )

Next, to clean up the harvest years.

harvest_years <- 
  coffee_a %>% 
  select(harvest_year, 
         # grading_date
  ) %>% 
  unique() %>% 
  print(nrow = nrow(.))
# A tibble: 47 × 1
   harvest_year          
   <chr>                 
 1 2014                  
 2 <NA>                  
 3 2013                  
 4 2012                  
 5 March 2010            
 6 Sept 2009 - April 2010
 7 May-August            
 8 2009/2010             
 9 2015                  
10 2011                  
# … with 37 more rows
harvest_years %>% datatable()
coffee_b <- 
  coffee_a %>% 
  filter(!harvest_year %in% c("May-August",
                              "mmm",
                              "TEST",
                              "January Through April",
                              "August to December",
                              "Mayo a Julio",
                              "Abril - Julio")) %>% 
  mutate(harvest_year = str_to_lower(harvest_year),
         harvest_year = str_replace_all(harvest_year, "[[:punct:]]", " "),
         #harvest_year_num = str_extract(harvest_year, "\\d{4}"))

         harvest_year = str_replace_all(harvest_year, "4t 10", "2010"),
         harvest_year = str_replace_all(harvest_year, "4t 2011", "2011"),
         harvest_year = str_replace_all(harvest_year, "4t 2010", "2010"),
         harvest_year = str_replace_all(harvest_year, "47 2010", "2010"),
         harvest_year = str_replace_all(harvest_year, "23 july 2010", "2010"),
         harvest_year = str_replace_all(harvest_year, "3t 2011", "2011"),
         harvest_year = str_replace_all(harvest_year, "4t72010", "2010"),
         harvest_year = str_replace_all(harvest_year, "08 09 crop", "2009"),
         harvest_year = str_replace_all(harvest_year, "1t 2011", "2011"),
         harvest_year_num = as.factor(parse_number(harvest_year)))

# check       
coffee_b %>% 
  select(harvest_year, harvest_year_num) %>% 
  unique() %>% 
  datatable()
ggplot(coffee_b,
       aes(x = harvest_year_num,
           y = total_cup_points)) +
  geom_boxplot() +
  ggtitle("Harvest years")
# leave it as it is
coffee_b %>% 
  # filter(is.na(harvest_year_num)) %>% 
  select(harvest_year, grading_date)
# A tibble: 1,301 × 2
   harvest_year grading_date
   <chr>        <date>      
 1 2014         2015-04-04  
 2 2014         2015-04-04  
 3 <NA>         2010-05-31  
 4 2014         2015-03-26  
 5 2014         2015-04-04  
 6 2013         2013-09-03  
 7 2012         2012-09-17  
 8 march 2010   2010-09-02  
 9 march 2010   2010-09-02  
10 2014         2015-03-30  
# … with 1,291 more rows

There are some obvious outliers for altitude, which should be removed.

altitude <- coffee_b %>% 
  select(starts_with("altitude"), unit_of_measurement )

altitude %>% datatable()
coffee_c <- 
  coffee_b %>% 
  select(-altitude_low_meters, 
         -altitude_high_meters, -harvest_year,
         -altitude, -unit_of_measurement) %>% 
  # remove obvious outliers
  filter(altitude_mean_meters < 5000 & altitude_mean_meters > 50)

coffee_c %>% 
  ggplot(aes(altitude_mean_meters)) +
  geom_boxplot()
skim(coffee_c)
Table 2: Data summary
Name coffee_c
Number of rows 1054
Number of columns 30
_______________________
Column type frequency:
Date 2
factor 11
numeric 17
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
grading_date 0 1 2010-04-09 2018-01-19 2014-04-30 456
expiration 0 1 2011-04-09 2019-01-19 2015-04-30 456

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 1 Ara: 1054
owner 7 0.99 FALSE 277 jua: 133, exp: 53, cqi: 41, ipa: 39
country_of_origin 0 1.00 FALSE 35 Mex: 230, Gua: 154, Col: 145, Bra: 91
company 141 0.87 FALSE 244 une: 71, exp: 39, ipa: 39, rac: 26
region 3 1.00 FALSE 322 hui: 91, ori: 65, sou: 55, ver: 31
in_country_partner 0 1.00 FALSE 26 AME: 203, Spe: 173, Alm: 144, Aso: 133
variety 83 0.92 FALSE 27 Cat: 234, Typ: 207, Bou: 205, Oth: 98
processing_method 69 0.93 FALSE 5 Was: 731, Nat: 167, Sem: 52, Oth: 25
color 129 0.88 FALSE 4 Gre: 732, Blu: 81, Blu: 68, Non: 44
certification_body 0 1.00 FALSE 19 AME: 203, Spe: 173, Alm: 144, Aso: 133
harvest_year_num 12 0.99 FALSE 10 201: 292, 201: 216, 201: 171, 201: 128

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_cup_points 0 1 82.11 3.67 0 81.17 82.50 83.67 90.58 ▁▁▁▁▇
number_of_bags 0 1 156.21 128.55 0 18.00 200.00 275.00 1062.00 ▇▇▁▁▁
aroma 0 1 7.57 0.39 0 7.42 7.58 7.75 8.75 ▁▁▁▁▇
flavor 0 1 7.52 0.41 0 7.33 7.50 7.75 8.83 ▁▁▁▁▇
aftertaste 0 1 7.39 0.41 0 7.25 7.42 7.58 8.67 ▁▁▁▁▇
acidity 0 1 7.53 0.39 0 7.33 7.50 7.75 8.75 ▁▁▁▁▇
body 0 1 7.50 0.36 0 7.33 7.50 7.67 8.58 ▁▁▁▁▇
balance 0 1 7.50 0.42 0 7.33 7.50 7.67 8.75 ▁▁▁▁▇
uniformity 0 1 9.86 0.53 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
clean_cup 0 1 9.85 0.80 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
sweetness 0 1 9.92 0.52 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
cupper_points 0 1 7.48 0.47 0 7.25 7.50 7.75 10.00 ▁▁▁▇▁
moisture 0 1 0.09 0.04 0 0.10 0.11 0.12 0.20 ▂▁▇▁▁
category_one_defects 0 1 0.38 1.88 0 0.00 0.00 0.00 31.00 ▇▁▁▁▁
quakers 1 1 0.14 0.72 0 0.00 0.00 0.00 11.00 ▇▁▁▁▁
category_two_defects 0 1 3.59 5.25 0 0.00 2.00 4.00 47.00 ▇▁▁▁▁
altitude_mean_meters 0 1 1347.24 448.25 100 1100.00 1320.00 1600.00 4287.00 ▂▇▁▁▁
plot_boxplot <- function(df, col){
  {{df}} %>% 
    ggplot(aes({{col}})) +
    geom_boxplot()
}

plot_boxplot(coffee_c, total_cup_points)
# remove the sample with cup points = 0
coffee_c %>% 
  filter(total_cup_points == 0)
# A tibble: 1 × 30
  total_cup_points species owner       country_of_orig… company region
             <dbl> <fct>   <fct>       <fct>            <fct>   <fct> 
1                0 Arabica bismarck c… Honduras         cigrah… comay…
# … with 24 more variables: number_of_bags <dbl>,
#   in_country_partner <fct>, grading_date <date>, variety <fct>,
#   processing_method <fct>, aroma <dbl>, flavor <dbl>,
#   aftertaste <dbl>, acidity <dbl>, body <dbl>, balance <dbl>,
#   uniformity <dbl>, clean_cup <dbl>, sweetness <dbl>,
#   cupper_points <dbl>, moisture <dbl>, category_one_defects <dbl>,
#   quakers <dbl>, color <fct>, category_two_defects <dbl>, …
coffee_d <- 
  coffee_c %>% 
  filter(total_cup_points != 0) %>% 
  select(-grading_date, -expiration,
         -owner, -company, -region, -species)

skim(coffee_d)
Table 3: Data summary
Name coffee_d
Number of rows 1053
Number of columns 24
_______________________
Column type frequency:
factor 7
numeric 17
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
country_of_origin 0 1.00 FALSE 35 Mex: 230, Gua: 154, Col: 145, Bra: 91
in_country_partner 0 1.00 FALSE 26 AME: 203, Spe: 173, Alm: 144, Aso: 133
variety 83 0.92 FALSE 27 Cat: 233, Typ: 207, Bou: 205, Oth: 98
processing_method 68 0.94 FALSE 5 Was: 731, Nat: 167, Sem: 52, Oth: 25
color 129 0.88 FALSE 4 Gre: 731, Blu: 81, Blu: 68, Non: 44
certification_body 0 1.00 FALSE 19 AME: 203, Spe: 173, Alm: 144, Aso: 133
harvest_year_num 12 0.99 FALSE 10 201: 292, 201: 216, 201: 171, 201: 128

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_cup_points 0 1 82.19 2.66 59.83 81.17 82.50 83.67 90.58 ▁▁▁▇▁
number_of_bags 0 1 156.10 128.56 0.00 18.00 200.00 275.00 1062.00 ▇▇▁▁▁
aroma 0 1 7.58 0.31 5.08 7.42 7.58 7.75 8.75 ▁▁▂▇▁
flavor 0 1 7.52 0.33 6.17 7.33 7.50 7.75 8.83 ▁▂▇▂▁
aftertaste 0 1 7.40 0.34 6.17 7.25 7.42 7.58 8.67 ▁▃▇▂▁
acidity 0 1 7.53 0.31 5.25 7.33 7.50 7.75 8.75 ▁▁▃▇▁
body 0 1 7.51 0.28 6.33 7.33 7.50 7.67 8.58 ▁▂▇▂▁
balance 0 1 7.51 0.35 6.08 7.33 7.50 7.67 8.75 ▁▂▇▃▁
uniformity 0 1 9.87 0.44 6.00 10.00 10.00 10.00 10.00 ▁▁▁▁▇
clean_cup 0 1 9.85 0.74 0.00 10.00 10.00 10.00 10.00 ▁▁▁▁▇
sweetness 0 1 9.93 0.42 1.33 10.00 10.00 10.00 10.00 ▁▁▁▁▇
cupper_points 0 1 7.48 0.41 5.17 7.25 7.50 7.75 10.00 ▁▂▇▁▁
moisture 0 1 0.09 0.04 0.00 0.10 0.11 0.12 0.20 ▂▁▇▁▁
category_one_defects 0 1 0.38 1.88 0.00 0.00 0.00 0.00 31.00 ▇▁▁▁▁
quakers 1 1 0.14 0.72 0.00 0.00 0.00 0.00 11.00 ▇▁▁▁▁
category_two_defects 0 1 3.59 5.25 0.00 0.00 2.00 4.00 47.00 ▇▁▁▁▁
altitude_mean_meters 0 1 1347.19 448.46 100.00 1100.00 1320.00 1600.00 4287.00 ▂▇▁▁▁
plot_boxplot(coffee_d, total_cup_points)
plot_boxplot(coffee_d, number_of_bags)
plot_boxplot(coffee_d, moisture)
plot_boxplot(coffee_d, category_one_defects)
plot_boxplot(coffee_d, quakers)
plot_boxplot(coffee_d, category_two_defects)

# Univariate factor variables

coffee_d %>% 
  ggplot(aes(quakers)) +
  geom_bar(stat = "count")
plot_bar <- function(col){
  coffee_d %>% 
    ggplot(aes({{col}})) +
    geom_bar(stat = "count") +
    ggtitle(enquo(col))
}

plot_bar(quakers)
plot_bar(category_one_defects)
plot_bar(category_two_defects)
coffee_d %>% 
  ggplot(aes(total_cup_points)) +
  geom_histogram()
plot_hist <- function(df, col){
  
  {{df}} %>% 
    ggplot(aes({{col}})) +
    geom_histogram() +
    ggtitle(enquo(col))
}


plot_hist(coffee_d, total_cup_points)
plot_hist(coffee_d, altitude_mean_meters)

coffee_d %>% 
  ggplot(aes(x = altitude_mean_meters,
             y = total_cup_points)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2),
              se = F,
              col = "red")

coffee_d %>% 
  ggplot(aes(x = moisture,
             y = total_cup_points)) +
  geom_point()

Dataset to be used for modelling:

coffee_cleaned <- coffee_d %>% 
  # mutate(log10_points = log10(total_cup_points)) %>% 
  mutate(variety = fct_explicit_na(variety)) %>% 
  
  mutate(processing_method = fct_explicit_na(processing_method)) %>% 
  
  mutate(color = na_if(color, "None")) %>% 
  mutate(color = fct_explicit_na(color)) %>% 
  
  select(total_cup_points,
         country_of_origin,
         in_country_partner,
         variety,
         processing_method,
         color,
         certification_body,
         harvest_year_num,
         moisture,
         category_one_defects,
         category_two_defects,
         quakers,
         altitude_mean_meters)
# plot variety by total cup points
coffee_cleaned %>% 
  ggplot(aes(fct_reorder(variety, total_cup_points), 
             total_cup_points)) +
  geom_boxplot() +
  coord_flip()
# create function
plot_ordered_boxplot <- function(df, col){
  df %>% 
    ggplot(aes(fct_reorder({{col}}, total_cup_points), 
               total_cup_points)) +
    geom_boxplot() +
    coord_flip() +
    ggtitle(enquo(col))
}


plot_ordered_boxplot(coffee_cleaned, country_of_origin)
plot_ordered_boxplot(coffee_cleaned, in_country_partner)
plot_ordered_boxplot(coffee_cleaned, variety)
plot_ordered_boxplot(coffee_cleaned, processing_method)
plot_ordered_boxplot(coffee_cleaned, color)
plot_ordered_boxplot(coffee_cleaned, certification_body)

# ggpairs-----
coffee_cleaned %>% 
  select(where(is.numeric)) %>% 
  ggpairs()
skim(coffee_cleaned)
Table 4: Data summary
Name coffee_cleaned
Number of rows 1053
Number of columns 13
_______________________
Column type frequency:
factor 7
numeric 6
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
country_of_origin 0 1.00 FALSE 35 Mex: 230, Gua: 154, Col: 145, Bra: 91
in_country_partner 0 1.00 FALSE 26 AME: 203, Spe: 173, Alm: 144, Aso: 133
variety 0 1.00 FALSE 28 Cat: 233, Typ: 207, Bou: 205, Oth: 98
processing_method 0 1.00 FALSE 6 Was: 731, Nat: 167, (Mi: 68, Sem: 52
color 0 1.00 FALSE 4 Gre: 731, (Mi: 173, Blu: 81, Blu: 68
certification_body 0 1.00 FALSE 19 AME: 203, Spe: 173, Alm: 144, Aso: 133
harvest_year_num 12 0.99 FALSE 10 201: 292, 201: 216, 201: 171, 201: 128

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_cup_points 0 1 82.19 2.66 59.83 81.17 82.50 83.67 90.58 ▁▁▁▇▁
moisture 0 1 0.09 0.04 0.00 0.10 0.11 0.12 0.20 ▂▁▇▁▁
category_one_defects 0 1 0.38 1.88 0.00 0.00 0.00 0.00 31.00 ▇▁▁▁▁
category_two_defects 0 1 3.59 5.25 0.00 0.00 2.00 4.00 47.00 ▇▁▁▁▁
quakers 1 1 0.14 0.72 0.00 0.00 0.00 0.00 11.00 ▇▁▁▁▁
altitude_mean_meters 0 1 1347.19 448.46 100.00 1100.00 1320.00 1600.00 4287.00 ▂▇▁▁▁

Tidymodels

I will only be using random forest to model, and the preprocessing recipe is quite simple as random forest is insensitive to skewed distributions. However, missing values must be imputed.

Split

set.seed(22071301)
coffee_split <- 
  coffee_cleaned %>% 
  initial_split()


coffee_train <- 
  coffee_split %>% 
  training()


coffee_test <- 
  coffee_split %>% 
  testing()

Preprocess

recipe_rf <- 
  recipe(total_cup_points ~.,
         data = coffee_train) %>% 
  step_impute_knn(all_predictors()) %>% 
  step_poly(altitude_mean_meters, degree = 2)

Fit

RF <- 
  rand_forest() %>% 
  set_args(mtry = tune(),
           min_n = tune(),
           trees = 1000) %>% 
  set_engine("ranger",
             importance = "permutation") %>% 
  set_mode("regression")

Tune

workflow_rf <- 
  workflow() %>% 
  add_recipe(recipe_rf) %>% 
  add_model(RF)

Cross-validate

set.seed(22071302)
CV <- 
  coffee_train %>% 
  vfold_cv(v=10)

tuned_rf <- 
  workflow_rf %>% 
  tune_grid(resamples = CV)

tuned_rf %>% 
  autoplot()
parameters_tuned_rf <- 
  tuned_rf %>% 
  select_best(metric = "rmse")

best_workflow_rf <- 
  workflow_rf %>% 
  finalize_workflow(parameters_tuned_rf)

fit_rf <- 
  best_workflow_rf %>% 
  last_fit(coffee_split)

Assess model

model_performance_rf <- 
  fit_rf %>% 
  collect_metrics()

model_performance_rf # rmse = 2.07, r-sq = 0.267
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       2.06  Preprocessor1_Model1
2 rsq     standard       0.269 Preprocessor1_Model1

Assess predictions

predictions_rf <- 
  fit_rf %>% 
  collect_predictions()

predictions_plot <- predictions_rf %>% 
  ggplot(aes(x = total_cup_points,
             y = .pred)) +
  geom_point(color = "midnightblue",
             alpha = 0.5) +
  geom_abline(color = "red",
              lty = 2) +
  labs(x = "Actual Total Cupping Points",
       y = "Predicted Total Cupping Points") +
  tune::coord_obs_pred()



plotly::ggplotly(predictions_plot)

The predictions were quite off for those with actual scores below 80, and the model tend to over-estimate the cupping scores.

Variable Importance

finalized_model <- 
  best_workflow_rf %>% 
  fit(coffee_cleaned)


feat_imp_model <- 
  RF %>% 
  finalize_model(select_best(tuned_rf)) %>% 
  set_engine("ranger",
             importance = "permutation")


feature_importance <- 
  workflow() %>% 
  add_recipe(recipe_rf) %>% 
  add_model(feat_imp_model) %>% 
  fit(coffee_train) %>% 
  extract_fit_parsnip() %>% 
  vip()

feature_importance

References:

Citation

For attribution, please cite this work as

lruolin (2022, June 25). pRactice corner: Predicting coffee cupping scores. Retrieved from https://lruolin.github.io/myBlog/posts/20220625 - Predict coffee cupping score/

BibTeX citation

@misc{lruolin2022predicting,
  author = {lruolin, },
  title = {pRactice corner: Predicting coffee cupping scores},
  url = {https://lruolin.github.io/myBlog/posts/20220625 - Predict coffee cupping score/},
  year = {2022}
}